Data Visualization

Author
Affiliation

Dr. Devan Becker

Wilfrid Laurier University

1 Preliminaries: Data and Data Types

1.1 Let’s make this quick!

Unlike in Beetlejuice, you only have to put an R object names once to make it appear.

For data sets, this prints a lot of information to the screen. We often use glimpse() instead:

Notice that there’s an R object called penguins, and there are columns called body_mass_g, flipper_length_mm, and species.

Also note that some are called “<fct>” and some are called “<dbl>”. If you look closely, you’ll notice that the <fct> variables are all words, and the <dbl> are all numbers.

  • Factor/Categorical: Only takes specific values.
    • Examples: Names, number of cylinders in a car engine.
      • If something is labelled as a number, but is better listed as a factor, you may need the factor() function.
    • To be a factor variable, you must be able to list out all of the possible values ahead of time.
    • Often labelled <fct> in R.
  • Numeric: A number.
    • Often labelled <dbl> in R, which is short for “double precision”, which just means that the computer will be ready for a lot of decimal places if needed.

2 ggplot Basics

For now, these are sacred incantations which you will need to remember. As we learn more R, they’ll make more sense!

Practice practice practice. Keep a file that explains every piece of code as well as you can! If you come to my office saying that you studied for hours and still got a bad grade, I’m going to ask to see your code!

2.1 Building a ggplot2 work of art

  • ggplot() is the canvas
  • aes() are the paints
  • geom_*()s are the brush strokes

In the following code chunk, we’ll add each element one-by-one and see what happens!

Where do I put aes()?

There are lots of options! All three of the following make the exact same plot:

We’ll see some cases where we set aesthetics separately later!


2.2 A quick geometry tour

Un-comment each of the lines (one at a time) and observe!

We can also add more than one aesthetic to a plot. The following is the most common use-case:

Note that the plot looks slightly different if you change the order of the geoms!

2.3 Playing with aesthetics

Check the help file for geom_point(), and try each of the aesthetics!

Try using the columns labelled species (categorical/factor) and bill_length_mm (continuous) for each of them.

Note that some geoms take different aesthetics, or just look better with different aesthetics:

2.4 Args and Aes

The colour can be set according to data (aes) or can be set for all of the data points (as an argument).

2.5 The fascinating facets of a plot

The facet_*() functions allow you to make different plots based on the values of a categorical variable.

What goes in the facet brackets?

You’ll notice that the textbook always uses vars(...) (vars is short for variables, not variances), whereas I will sometimes use ~. The vars(...) approach is the recommended one, the ~ approach might stop working one day but it’s convenient for now.


3 Making Good Plots

To test your knowledge, we’ll use the mpg data set, which is loaded when you load the ggplot2 package. Play around with the geoms listed above to see what errors you get and which plots look the best!

  1. hwy (the fuel efficiency of the car on the highway, in miles per gallon) versus displ (the size (displacement) of the engine).
Solution

  1. Highway fuel efficiency versus the classification of the car.
  1. Highway fuel efficiency versus the number of cylinders.
Solution

This one is tricky - the number of cylinders are numbers, but the plot looks much better if they are treated as a factor variable.


  1. The highway fuel efficiency versus the displacement of the engine for different numbers of cylinders.
Solution

There are two ways to do this: with aesthetics or with facets! Which one is easier to get insights from? Which one is prettier?